The adapted Activity-By-Contact model for enhancer–gene assignment and its application to single-cell data

Abstract Motivation Identifying regulatory regions in the genome is of great interest for understanding the epigenomic landscape in cells. One fundamental challenge in this context is to find the target genes whose expression is affected by the regulatory regions. A recent successful method is the Activity-By-Contact (ABC) model which scores enhancer–gene interactions based on enhancer activity and the contact frequency of an enhancer to its target gene. However, it describes regulatory interactions entirely from a gene’s perspective, and does not account for all the candidate target genes of an enhancer. In addition, the ABC model requires two types of assays to measure enhancer activity, which limits the applicability. Moreover, there is neither implementation available that could allow for an integration with transcription factor (TF) binding information nor an efficient analysis of single-cell data. Results We demonstrate that the ABC score can yield a higher accuracy by adapting the enhancer activity according to the number of contacts the enhancer has to its candidate target genes and also by considering all annotated transcription start sites of a gene. Further, we show that the model is comparably accurate with only one assay to measure enhancer activity. We combined our generalized ABC model with TF binding information and illustrated an analysis of a single-cell ATAC-seq dataset of the human heart, where we were able to characterize cell type-specific regulatory interactions and predict gene expression based on TF affinities. All executed processing steps are incorporated into our new computational pipeline STARE. Availability and implementation The software is available at https://github.com/schulzlab/STARE Contact marcel.schulz@em.uni-frankfurt.de Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Unravelling the mechanisms behind gene expression regulation is a central task in epigenomics. Enhancers are key players in this process. They are accessible regions in the genome, which can be bound by transcription factors (TFs) in a sequence-specific fashion. Those TFs have a variety of functions: recruit other cofactors, remodel chromatin conformation, cause changes in epigenetic modifications or directly interact with the transcription machinery, affecting gene expression (Gonzalez, 2016;Lambert et al., 2018;Pabo and Sauer, (Schoenfelder and Fraser, 2019;Yao et al., 2015). Many approaches exist to predict target genes of enhancers, for example, using correlation of enhancer activity and gene expression (Gao and Qian, 2019;Schmidt et al., 2021;The FANTOM Consortium et al., 2014), or correlation of accessibility across samples (Pliner et al., 2018). Another possibility is to include chromatin contact data, for example, Hi-C (Lieberman-Aiden et al., 2009), to call chromatin loops for annotating enhance-gene links (Rao et al., 2014;Schmidt et al., 2020;Yi et al., 2021). Although loops correlate with gene expression, their anchors only cover a fraction of active promoters and enhancers and their removal impacts expression of only few genes (Nora et al., 2017;Rao et al., 2017;Schoenfelder and Fraser, 2019). Fulco et al. (2019) combined measurements of enhancer activity with chromatin contact data and proposed the Activity-By-Contact (ABC) model. The assumption is that active enhancers that frequently contact a gene's promoter are more likely to affect a gene's regulation. The ABC model requires DNase-seq, H3K27ac ChIP-seq data and a Hi-C matrix. The Hi-C matrix can be substituted with a matrix averaged over multiple cell types, or with a quantitative function describing the distance-contact relationship (Fulco et al., 2019). The original ABC-model formulation is entirely gene-centric, which means it does not take the candidate target genes of an enhancer into account.
We propose a generalized ABC (gABC) score with two adaptations: first, it describes enhancer activity in a gene-specific manner and second, it uses the information of all annotated transcription start sites (TSSs) of a gene. Further, we could show that, instead of using both DNase-and H3K27ac ChIP-seq data for the ABC model, one assay for measuring enhancer activity yields a similar accuracy. We validated the adaptations on three datasets of experimentally tested enhancer-gene interactions in K562 cells and on expression quantitative trait loci (eQTL) data from different tissues. We developed STARE, a fast implementation of the ABC score with an approach to quantify TF binding affinity for genes. STARE can compute enhancer-gene interactions from single-cell chromatin accessibility data, illustrated on data of the human heart. With only one data modality at single-cell resolution, we identified and characterized cell type-specific (CS) regulatory interactions.

ABC score
We use the terms enhancer and regulatory region interchangeably. The principle of the ABC model is that an enhancer, which is highly active and has a high contact frequency with a gene, is likely to regulate it (Fulco et al., 2019). The ABC score represents the relative contribution of an enhancer r to the regulation of gene g, measured by the enhancer's activity A r and its contact frequency C r;g with the promoter of g. For each candidate enhancer, the activity is multiplied with the contact frequency and this product is taken relative to the sum over all candidate enhancers R g in a predefined window around the TSS of a gene: By definition the scores per gene sum up to 1. In practice, a cutoff is used to select valid interactions. The ABC model allows a many-to-many relationship: a gene may link to multiple enhancers and an enhancer may link to multiple genes. We tested three types of epigenomic assays to approximate A r by counting sequencing reads in the enhancer: DNase-seq, ATAC-seq or H3K27ac ChIP-seq. The contact C r;g is taken from a normalized chromatin contact frequency matrix. A pseudocount is added to each C r;g , so that all candidates R g are taken into account (see Supplementary Material).

gABC score
We present a 2-fold adaptation of the ABC score, to account for all candidate target genes of an enhancer and for all TSSs of a gene. Regarding the former, the activity of an enhancer conceptually represents all regulatory interactions an enhancer has. However, an enhancer can interact with different genes and not all genes are equally likely to be brought into vicinity of the enhancer. We assume that an enhancer's regulatory input is a function of the number of contacts with all potential target genes. Target genes that are often in contact with the enhancer would receive more regulatory input than genes with fewer contacts. Thus, the activity A r could be denoted in a relative manner: where A r;g denotes the relative regulatory activity of enhancer r towards gene g and G r denotes the set of all genes that are located within a predefined window around r. Since the values of A r;g are not known, we propose an approximation by using chromatin contacts: We approximate the regulatory activity A r;g by the relative fraction of its contact C r;g to the TSS of g to the sum over the contacts to all genes G r in a window around the enhancer. Thus, the ABC score becomes In comparison to Equation (1), the activity A r of an enhancer is replaced with its gene-specific relative activity A r;g , see Equation (3). Thus, this adapted score does not only function in a gene-centric way, but also in an enhancer-centric way, by adapting the activity to the relative number of contacts of an enhancer's candidate target genes. This adaptation therefore uses a reduced enhancer activity estimate, in particular for enhancers in contact with many genes, and prevents them from being accounted with their full activity for all genes within the window.
In addition, instead of considering only one TSS per gene, for example, the 5 0 TSS, we propose to include all TSSs of a gene in the following fashion: where TSS g are all annotated TSSs of gene g. That allows to include the contact information to more potentially relevant transcription sites and omits the selection of an individual TSS. We name the score with these two changes the gABC score.

Validation on CRISPR screens
To validate our gABC score and to test different assays for measuring enhancer activity, we examined the performance on experimentally validated enhancer-gene interactions. We made use of three CRISPRi screens for K562 cells. Gasperini  We also compared gABC with Enformer, a sequence-based deep learning model predicting gene expression and chromatin states, whose characteristic is an increased information flow between distal sequence positions (Avsec et al., 2021). To quantify enhancer-gene interactions with Enformer we compared the expression estimate upon in silico mutagenesis of the enhancer region, by either replacing 2 kilobase (kb) centred at the enhancer with neutral nucleotides, or by shuffling the sequence for 25 iterations (Karollus et al., 2022). Due to the size of Enformer's receptive field, we limited the comparison to interactions with 96 kb distance.

TF affinities and summarization on gene level
In addition to enhancer-gene interactions, STARE aims to describe a TF's regulatory impact on a gene ( Supplementary Fig. S1). There are two steps required. First, genomic regions that influence the regulation of a gene have to be identified. This can be done via ABC scoring as described before, or in a more simplistic approach by taking all open regions within a defined window around the gene's TSS into account. Utilizing the ABC model allows STARE to function with any desired window. Second, the affinities of TFs to the identified regions have to be quantified. Instead of relying on calling TF binding sites, we use the tool TRAP, which calculates relative binding affinities for TFs in a genomic region. TRAP describes TF binding with a biophysical model to predict the number of TF molecules that bind in a sequence (Roider et al., 2007). The higher the affinity, the more likely a TF is to bind. Retaining low binding affinities can hold valuable information (Kribelbauer et al., 2019;Schmidt et al., 2016) and omits selection of an arbitrary threshold. For all analyses presented in this manuscript, we used a non-redundant collection of 818 human TF motifs in the form of position frequency matrices (PFMs) from JASPAR 2022, HOCOMOCO v11 and the work of Kheradpour and Kellis (2014) (Castro-Mondragon et al., 2022;Kulakovskiy et al., 2018). When converting PFMs to positionspecific energy matrices required by TRAP, we take the average nucleotide content of the candidate regulatory regions as background.
To summarize the TF affinities in enhancers per gene, we combine them with the predicted enhancer-gene interactions. The summarization depends on how the region-gene mapping was done. For the window-based approach, TF affinities in all open regions around the gene's TSS are summed: where af g;tf is the affinity of TF tf summarized for gene g. R g is the set of all open regions r that were located within the window around g. af r;tf is the affinity of tf in r, ml tf is the motif length of tf and A r is the activity for r. The affinity is corrected for the distance d r;g of r to the TSS of g by an exponential decay function, as proposed by Ouyang et al. (2009), where d 0 is set to a constant of 5000 bp. When the gABC score was used to assign regions to genes, the summarization changes, as there is more epigenomic information available. Regions close to the TSS ( 2500 bp) are always included, independently of their gABC score, as they are very informative for the expression regulation of a gene . They are also scaled differently: R g is the set of regions that was linked to the gene with the gABC score and A r;g is the adapted activity (Equation 3). For regions close to the TSS, the base activity A r is corrected with the exponential decay function, as the contact frequency would likely be the contact of the region with itself and thus A r;g could be erroneous. When using the regular ABC score, the affinity scaling changes as follows: The regions close to the TSS are scaled in the same way as before. For all the other regions, we divide the contact frequency C r;g by the maximum contact that was measured for all region-gene pairs C max . The reasoning is to incorporate the contact frequency and to have both multipliers for A r in the range of [0, 1]. Essentially, the activity multipliers for gABC and ABC differ only in how the contact is scaled: for gABC relative to all gene contacts of the respective enhancer and for ABC to the contacts of all enhancer-gene pairs.
In addition to the TF affinities, we report three additional gene features: the number of regions considered per gene, the regions' average distance to the TSS and the regions' average base pair length, as all three can be predictive of gene expression (Schmidt and Schulz, 2019).

Application to single-cell data
To test the capability of gABC-scored enhancer-gene interactions combined with TF affinities to capture regulatory CS information, we analysed a single-cell dataset of the human heart from Hocker et al. (2021), providing single-nuclei (sn) ATAC-seq, as well as snRNA-seq data. The candidate enhancers were pooled and ATACseq reads per kilobase per million reads (RPKM) was measured for each cell type. For chromatin contacts, we tested H3K27ac HiChIP data of the left ventricle (Anene-Nzelu et al., 2020) as well as an average Hi-C matrix (Fulco et al., 2019). Regions known to accumulate an anomalous amount of sequencing reads were excluded (Amemiya et al., 2019;The ENCODE Project Consortium, 2012). As we had enhancer activity and enhancer contact data at hand, we determined enhancer-gene interactions for each cell type with the gABC score (cut-off 0.02, 5 MB window around all annotated TSS) and summarized TF affinities for each gene in each cell type based on those interactions. To assess the predictability of gene expression by gene-TF affinities, we used INVOKE (Schmidt et al., 2017), which implements a linear regression model based on gene-TF scores, and selects TFs that are most predictive of gene expression. We trained prediction models for multiple set-ups for each cell type to compare their prediction accuracy. We repeated this process for CS genes, defined as genes where the z-score of expression between cell types was !2 and transcripts per million (TPM) !0.5. The gene-TF matrices were limited to TFs expressed in a cell type (TPM ! 0.5). Schmidt et al. (2020) applied INVOKE on similar data types for bulk data, but restricted information of distant enhancers to those connected to promoters via loops. With the ABC approach we have a finer resolution of regulatory interactions and can integrate the contact frequency into the summarization of TF affinities.

Comparison to co-accessibility analysis
A common approach to identify regulatory interactions in single-cell ATAC-seq data is to call co-accessible regions. Hocker et al. (2021) ran Cicero (Pliner et al., 2018) on their snATAC-seq data to derive pairs of regions with correlated accessibility, limited to a distance of 250 kb. Whenever either side of a co-accessible region pair overlapped a 400-bp window around any annotated TSS of a gene, we considered it as an enhancer-gene interaction for that gene. We tested how informative the resulting 62 384 co-accessible interactions are for our gene expression prediction model. Summarization of TF affinities was done according to Equation (6) and the affinities of regions close to the TSS ( 2500 bp) were included, as described in Section 2.4.

Intersection with eQTLs
Further, we compared the agreement of the regular ABC and gABC score with eQTL data. We intersected ABC-scored interactions from four different heart cell types (Hocker et al., 2021) and K562 cells with eQTL-gene pairs of matching samples from the GTEx portal (The GTEx Consortium, 2020). We used high confidence eQTL-gene pairs from three different fine-mapping approaches, namely CAVIAR, CaVEMaN and DAP-G (Brown et al., 2017;Hormozdiari et al., 2014;Wen et al., 2016). For each set of eQTLs, we defined the enhancer-gene pairs that were supported by the eQTLs, meaning all candidate enhancers of a cell type with a variant, where the affected target gene was within the chosen ABC window size. Then, we compared the fraction of those eQTL-supported enhancer-gene pairs that we could also find among a variable number of highest scored ABC/gABC interactions (Recall). As we also had the co-accessibility analysis on the heart data, we examined how many eQTL-gene pairs the resulting interactions recover.

gABC score improves interaction prediction
We propose a gABC score (Equation 5), where the activity of an enhancer is described in a gene-specific manner ( Fig. 1c; Equation 3), and all annotated TSSs of a gene are considered. On all validation datasets and for all combinations of activity measurements, the gABC score outperformed the regular ABC score (P-value % 0.0005 Wilcoxon signed-rank test) ( Fig. 1a and d and Table 1). The difference was more pronounced in the Gasperini and Schraivogel validation data. Each of the two adaptations of the gABC score individually increased the area under the precision recall curve (AUPRC) compared with the regular ABC across activity assays, with the gene-specific activity providing an average gain of 0.026, and including all TSSs giving an average improvement of 0.053. Taken together, the gABC score yielded on average a 0.107 higher AUPRC (Supplementary Table S1). The areas under the ROC curves for gABC were significantly higher in 10 out of 12 pairwise comparisons across CRISPRi screens and activity assays (Supplementary Table S5). Using an average Hi-C matrix changed the accuracy marginally for both ABC scores (Supplementary Fig. S2a and Table 1). When using the fractal globule module to estimate contact frequency only based on distance, gABC achieved less improvement over the regular ABC with an average AUPRC gain of 0.048 (Supplementary  Table S1). We could reproduce the higher accuracy of the gABC score in a direct comparison to the implementation of Fulco et al.  Table S2). Further, we examined the correlation between each of the ABC scoring approaches and the absolute change in gene expression as measured in the CRISPRi screens. The gABC score showed a higher Spearman correlation coefficient across all three datasets (Supplementary Table S3).
To disentangle for which enhancers the gABC score performs better than the regular ABC, we focused on the largest CRISPRi screen from Gasperini et al. (2019) (Supplementary Fig. S3a) and found that the regular ABC predicted more false positive target genes for enhancers with a high activity (Fig. 1e and f and Supplementary Fig. S3b-e). There was a small subset of enhancers in gene-rich regions for which the gABC score predicted more false positive interactions ( Supplementary Fig. S3f-i).
Enformer is a novel method that predicts gene expression and chromatin states directly from the DNA sequence using a complex neural network architecture outperforming other sequence-based models (Avsec et al., 2021). We compared the gABC score and an Enformer model learned on K562 cells using in silico mutagenesis, where the strength of interactions was quantified by the predicted expression change upon sequence perturbation of the enhancer. gABC achieved a higher accuracy on all three validation datasets (Fig. 1b) testing alternative ways for the in silico mutagenesis (Supplementary Table S4). This was also reflected in significantly higher areas under the ROC curves (all P-values 0.05, DeLong et al., 1988;Robin et al., 2011).
Although it uses the same information, the gABC score performed better than the regular ABC score and outperformed the accurate sequence-based Enformer model.

One activity assay yields similar accuracy
The original formulation of the ABC score requires two types of assays to measure enhancer activity, namely DNase-seq and H3K27ac ChIP-seq (Fulco et al., 2019). We tested the performance of the gABC scoring principle using different assays for measuring enhancer activity ( Fig. 1a and d, Table 1 and Supplementary Fig.  S2b). On the datasets of Schraivogel and Fulco, the combination of DNase-seq and H3K27ac ChIP-seq performed better than either of them alone, although the performance drop with only DNase-seq was small. On the Gasperini data, DNase-seq without H3K27ac ChIP-seq was slightly better. Quantifying enhancer activity with ATAC-seq resulted in a worse performance on the Schraivogel and Fulco validation data, but a clear improvement on the Gasperini dataset, surpassing the combination of DNase-seq and H3K27ac ChIP-seq.

Regulatory interactions in single-cell data
Using just one epigenome assay for the ABC score enables direct application on high-resolution snATAC-seq data, which has the potential to improve the prediction of CS interactions. As a proof of concept, we analysed a human heart snATAC-seq dataset, comprising eight cell type clusters (Hocker et al., 2021). The candidate enhancers of the cell types were pooled to a set of 286 777 regions with a summarized ATAC-seq measurement for each defined cell type (Fig. 2a). On average we predicted 408 846 gABC-scored interactions (SD % 12 200) over all cell types. Approximately 23.6% of a cell type's interactions were shared with all other cell types ( Fig. 2b and Supplementary Fig. S4a). Each cell type also featured unique interactions although this was highly variable (l % 11.6%, SD % 5.1%). Atrial cardiomyocytes (aCM) and ventricular cardiomyocytes (vCM) formed the largest intersection of interactions found in only two cell types, consisting of 39 707 interactions. The average median of enhancers per expressed gene (TPM ! 0.5) across cell types was 4.75 (SD % 0.43) (Supplementary Fig. S4b). Despite all cell types having the same candidate enhancers and a shared contact measurement, their predicted enhancer-gene interactions appeared to be considerably distinct.

gABC score recovers more eQTL-gene pairs
We examined how many eQTLs from different tissues are recovered by ABC interactions from matching cell types, including four heart cell types from Hocker et al. (2021) and K562 cells. We took high confidence eQTL-gene pairs from three different fine-mapping methods and compared which fraction of enhancer-gene interactions supported by eQTLs were also found by the 300 000 highest scored ABC and gABC interactions (Fig. 2c). The gABC interactions recovered significantly more eQTL-gene pairs across all finemapping methods and eQTL datasets (P-value % 0.0005 Wilcoxon signed-rank test). This finding was reproduced for the 100 000 and 200 000 highest scored interactions. The gABC interactions also captured more eQTL-gene pairs than interactions derived via co-accessibility analysis (Supplementary Fig. S4c).

Linking epigenetic features to CS expression
We trained CS gene expression prediction models based on gene-TF affinity matrices, constructed with different approaches (Fig. 3a). Using the gABC score performed best across all cell types (Pearson correlation coefficient l % 0.563), with similar results using the average Hi-C matrix (l % 0.556). The regular ABC score had a slightly lower performance (l % 0.538), whereas interactions identified via co-accessibility resulted in the lowest performance (l % 0.505). Since the co-accessible interactions were limited to a distance of 250 kb, we also tested the gABC score in a 500-kb window, which marginally decreased the performance in the prediction model (l % 0.562, Supplementary Fig. S5a).
We defined a set of CS genes (TPM ! 0.5 and z-score ! 2) for each cell type (l ¼ 1404 genes, SD % 582 genes) and analysed those in more detail. The CS genes were mostly unique to a cell type ( Supplementary Fig. S4d) and GO term enrichment returned cell-type appropriate terms (Supplementary Fig. S4e). To further characterize the sets of CS genes, we examined additional attributes and compared CS genes with non-CS genes, both sets restricted to expressed genes (TPM ! 0.5) (Fig. 3b and Supplementary Fig. S4f). CS genes had more assigned enhancers than non-CS genes in all cell types except for LC. This matches the finding of Fulco et al. (2019), who described a higher number of enhancers for tissuespecific than for ubiquitously expressed genes. Furthermore, CS genes tended to have a higher percentage of unique interactions, meaning the fraction of interactions of a gene, that were exclusively found in that cell type, was higher. Most cell types had a slightly higher average activity in enhancers linked to CS genes than in enhancers linked to non-CS genes, except for aCM and FB. There was no clear trend visible for the average contact frequency or TSS distance of the assigned enhancers. To exclude that the differences were solely caused by the CS genes' higher expression, we repeated . 'equal FP' contains all enhancers where the number of FP target genes is the same for both scores (0 FP included). 'more FP' means that either of the scores called more FP target genes for that enhancer. The number of enhancers in each category is shown below the x-axis labels the comparisons but restricted them to CS and non-CS genes from the upper quartile of TPM values. The results were highly similar across all features (data not shown). We repeated the training of gene expression models but limited it to CS genes and observed an overall decrease in prediction accuracy ( Supplementary Fig. S5b). The gABC score still performed better than the ABC score. The prediction model returns a regression coefficient for each TF, indicative of TF relevance, and allows to investigate which TFs might drive gene expression in which cell type ( Supplementary Fig. S5c).
Overall, we were able to characterize CS expression regulation based on single-cell chromatin accessibility and bulk chromatin contact data.

Runtime of STARE
We optimized STARE's runtime by allowing multiple steps to be run in parallel (Fig. 4a) and to omit redundant calculations, if multiple cell types/metacells/individual cells with a respective activity measurement are processed. The runtime per activity column decreases when multiple columns are handled in the same run, allowing large datasets to be processed in a few minutes (Fig. 4b).

Discussion
We present a variation of identifying regulatory enhancer-gene interactions with the ABC model. In its original study, the ABC  Recall is the fraction of enhancer-gene pairs found by each score out of all pairs where the enhancer contained an eQTL whose target gene was within the window size used for ABC scoring. The 300 000 highest scored interactions of ABC and gABC were used model already showed a better accuracy for detecting validated enhancer-gene links than other approaches (Fulco et al., 2019). We propose a gABC score where the activity of an enhancer is described in a gene-specific manner by weighting it relative to the number of enhancer-gene contacts and where all TSSs of a gene are considered. This generalization resulted in an improvement in identifying experimentally validated enhancer-gene interactions, both compared with the regular ABC, as well as to the deep learning model Enformer. It should be noted however, that Enformer was built to predict gene expression, and not to identify enhancer-gene links (Avsec et al., 2021). The accuracy of different scoring approaches was partially inconsistent between validation datasets, especially when using ATAC-seq for enhancer activity. Although all datasets are based on a dCas9-KRAB system, there are differences in the experimental set-ups and scope. Gasperini et al. (2019) introduced multiple guide RNAs and measured expression via single-cell RNA-seq, allowing quantification of interactions for over 10 000 genes. Schraivogel et al. (2020) presented their method TAP-seq, while Fulco et al. (2019) published CRISPRi-FlowFISH and collected previous results, both evaluating interactions for less than one hundred genes. Details in candidate enhancer selection, filtering steps and processing might lead to biases or varying sensitivity in the identification of enhancergene pairs, which is already indicated by the different fractions of significant interactions. Further, it is debatable whether CRISPRi screens are able to detect all regulatory interactions, as true enhancers with small effect sizes may be overlooked. Moreover, perturbations of individual enhancers are presumably not capable of accounting for shadow enhancers with redundant functionality (Schoenfelder and Fraser, 2019;Singh and Yi, 2021). More large-scale validation data could consolidate and contextualize our findings, but are currently unavailable. Importantly, any model to annotate enhancer-gene interactions is only a prediction and likely not capturing the whole regulatory complexity of genes. The ABC model requires two data types, which makes it applicable in a range of scenarios, but it might also miss out relevant epigenetic information. Further, the model assumes all genes are regulated in the same way.
In context of required data types, we were able to demonstrate that, unlike suggested in the original work, one assay for measuring enhancer activity works similarly well, specifically using DNase-or ATAC-seq without H3K27ac ChIP-seq data. Further, using averaged contact data yielded a high performance as well, which broadens the applicability of the ABC score to all datasets, where a measurement of enhancer activity is available. Especially for singlecell epigenomics, it is challenging to measure multiple modalities in the same cell.
We present the STARE framework to derive gene-TF affinities. After mapping candidate enhancers to genes, using either the ABC score or a window-based approach, STARE summarizes TF affinities on a gene level. Unlike other methods aiming to determine regulatory relations of TFs to genes (Lan et al., 2012;Wang et al., 2013), STARE does not require scarce TF ChIP-seq data. It uses a motif-based biophysical model (Roider et al., 2007) to determine TF affinities in accessible regions. As consequence, unlike other methods (McLeay et al., 2012), it is able to retain low affinity binding information. Patel and Bush (2021) use similar data as STARE and rely on a graph-based approach, but they do not incorporate active regulatory regions and analysis is limited to a window marked by the most distant CTCF peaks within 50 kb of the gene body. With the ABC model, STARE specifically determines candidate enhancers within any selected window size. Notably, our model assumes an additive influence of TFs, which is not likely to accurately capture biological reality (Zeitlinger, 2020). Furthermore, there is potential redundancy on two levels: the aforementioned functional redundancy of enhancers and the redundancy of TF binding motifs (Cusanovich et al., 2014;Gitter et al., 2009).
We applied our framework to single-cell data with clustered cell types of the human heart (Hocker et al., 2021), where the activity of a unified set of candidate enhancers was measured for each cell type.
The assumption was that cell type specificity is mainly driven by activity of regulatory regions and less by spatial chromatin conformation. Although chromatin contacts were also found to change upon cell differentiation (Fraser et al., 2009;Zhang et al., 2020), the 3D conformation of the genome is described as less dynamic and more as a scaffold to enable and stabilize regulatory interactions (Ing-Simmons et al., 2021;Schoenfelder and Fraser, 2019). We were able to unravel differences in regulatory interactions across cell types and to characterize regulation of CS genes, despite using bulk chromatin contact data. We demonstrated a downstream application example of STARE with a linear expression model, which allows to identify candidate regulatory TFs for further evaluation. Considering the small training sets and that the model assumes the same regulation for all genes, the prediction yielded a reasonable performance. Pearson correlation coefficient is shown as average over a 10-fold outer cross-validation. A horizontal line above the bars indicates significance (P-value 0.05, Mann-Whitney U test). snATAC-seq data were used as enhancer activity for all approaches. ABC/gABC H3K27ac HiChIP, regular/gABC scoring with H3K27ac HiChIP as contact data; gABC avg Hi-C, gABC with an average Hi-C matrix as contact data; Co-accessibility, enhancer-gene links defined by Cicero (Pliner et al., 2018), see Section 2.5.1. The respective Spearman correlation coefficients, as well as additional approaches and their performance in a training on CS genes only, are presented in Supplementary Figure S5. (b) Comparison of attributes between CS (TPM ! 0.5 and z-score ! 2) and not CS (non-CS) genes (TPM ! 0.5) (*P-value 0.05, Mann-Whitney U test). See Figure 2 for cell type abbreviations  (Hocker et al., 2021) with 286 777 candidate enhancers, for 55 765 genes in a 5-MB window. In total, 818 TFs were assessed. The bars' height represents the mean of three runs Schmidt et al. (2020) used a similar expression prediction approach and incorporated distant enhancers. However, their work relies on annotated loops, which are not likely to cover all relevant regulatory interactions. In addition, their strategy is not able to integrate contact frequencies into TF affinities, nor to derive CS interactions from bulk contact data. There are other tools for predicting gene expression explicitly in individual cells that incorporate TF information, such as SCENIC (Aibar et al., 2017), ACTION (Mohammadi et al., 2018) or TRIANGULATE (Behjati Ardakani et al., 2020), but none of them considers long-range enhancer-gene interactions. A compelling approach would be to combine these expression prediction tools with information on distant enhancers. STARE represents a currently unique form of deriving TF affinities on a gene level: it combines enhancer-gene links called by the ABC score with a non-hit-based TF annotation. Prospectively, we would like to apply STARE on individual cells instead of clustered cell types, which would require additional steps to account for the drastic sparsity of most single-cell measurements. It would also be highly interesting to further investigate the importance of chromatin contacts in cell type specificity, once the resolution and availability for such data are advanced.

Data availability
The presented results are provided via Zenodo (https://doi.org/10.5281/zen odo.5841991). All data are in hg19. For details on the used data, see the Supplementary Material.